Dublin Data Science Meetup | 17th May 2017

From Adaboost to XGBoost

The headlines

1995

Yoav Freund & Robert E. Schapire introduce AdaBoost


  • An enhancement to previous boosting meta-algorithms for classification
  • A linear combination of classifiers (weak learners) that can be shown to converge to a strong learner as long as each classifier added is at least slightly better than random guessing

1998

A Statistical View of Boosting (Friedman, Hastie & Tibshirani)


Regarding Adaboost

"For many classification algorithms, this simple strategy results in dramatic improvements. We show that this seemingly mysterious phenomenon can be understood in terms of well know statistical principles…"


Gradient Boosting is introduced in this paper

2008

2013

JMLR: Workshop and Conference Proceedings 42:19-55 review of Higgs Boson Challenge

…winning method used ensemble of deep neural networks.. computationally expensive…


A special "High Energy Physics meets Machine Learning" award was given to team Crowwork ( Tianqi Chen and Tong He)… A slightly lower score but provided a Boosted Decision Tree method that is a good compromise between performance and simplicity, which could improve tools currently used in high-energy physics

2016

Boosting

Some History

  • Idea of Boosting goes back to at least 1990 Schapire paper The Strength of Weak Learnability

  • Adaboost algorithm introduced in A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting (Freund & Schapire 1995)

  • The idea is to produce a strong learner by combining (adding together) a number of weak learners

  • A learner is any classification (or regression) rule

General Boosted Model Form

For a simple 2 class { -1,1 } classification problem, a Boosted Classifier has the form


\[ G(x) = \alpha_{1}G_{1}(x) + \alpha_{2}G_{2}(x) + \dots + \alpha_{M}G_{M}(x) \]

combining \(M\) simple classifiers \(G_m\) with weights (\(\alpha_m\)) to create a majority predictor \(G\).


If \(G(x) > 0\) then \(1\) else \(-1\).

Fitting a Boosted Model

  • Freund & Schapire's approach was to fit the ensemble model in a stagewise fashion (one-by-one)

  • Each stage would concentrate on cases that have been, up to that point, classified incorrctly (adaptive)

  • The weight of each new learner would depend on how effective it is

AdaBoost.M1 Basic Algorithm

  1. Initialise Weights of all cases to \[w_{i} = 1 / N\] for all \(i\) from 1 to N
  2. For m = 1 to M:

    • Find a weak learner and classify your training set
    • Compute how the learner did (error rate)
    • Use the error rate to set the weight of the learner \(\alpha_{m}\)
    • Increase the importance of the still misclassified cases

AdaBoost.M1 Formulae

Classifer \[h_m(x) \in \{-1,1\}\]

Error \[e_t = E_{w}[1_{(y\neq h_m(x)})] \]

Model Weight \[\alpha_{m} = \frac{1}{2}ln\left(\frac{1-e_{m}}{e_{m}} \right)\]

Case Reweighting \[ w_{m+1} = w_m\exp(\alpha_{m}\cdot1_{(y_i\neq G_m(x_i)}) \]

AdaBoost.M1 Update Step


Set \[H_m(x) = H_{m-1}(x) + \alpha_m h_m(x)\]

Statistical view of boosting

Gradient Descent

  • For a function \(f(x)\) with number valued input \(x\), the gradient descent update rule is

\[x^{(m+1)} = x^{(m)} + \Delta \times - \frac{df(x^{(m)})}{dx}\]

Gradient Descent

  • For a function with function valued input \(\ell(f)\), the gradient descent update rule is

\[f_{m+1}(x) = f_{m} + \Delta \times -\frac{d\ell}{df_m} \]

Gradient Descent, Least Squares Loss

  • Squared-Error loss function \[ \ell(x_{i},y,f) \propto \frac{1}{2}(y_i-f(x_i))^2 \]

  • Then the derivative \[ \frac{d\ell}{df} = -(y_i-f(x_i)) =-r_i \] suggests to add a function that maps \(x_i\) to its residual \(r_i\).

Gradient Descent, Least Squares Loss

  • So for Squared-Error loss, the gradient tells us to update by adding a function that maps \(x_i\) to its current residual \(r_i\).

  • We find the function from our basis set (trees, linear models, etc.) which brings us as close as possible to returning the residuals (i.e. we model the residuals)

  • We update our model so, descending the loss gradient: \[f_m(x) = f_{m-1}(x) + \alpha_m b_m(x)\]

Other Gradients

  • Poisson: \[y_i - exp(f(x_i))\]

  • Bernoulli: \[ y_i - \frac{1}{1 + exp(-f(x_i))} \]

  • What Adaboost uses: \[ -2(y_i-1)*exp(-2(y_i-1)f(x_i)) \]

Gradient Boosting implementation


  • Trees, or Tree Stumps are commonly used in boosted models


  • In general a variation of Gradient Descent called Stochastic Gradient Descent is used

Tree Models

A tree model (Titanic data)

Example Surface

Tree Ensemble: 7 Trees, Depth 6

Tree Ensemble: 35 Trees, Depth 6

Tree Ensemble: 70 Trees, Depth 6

Tree Ensemble: 70 Trees, Depth 2

Bias and Variance

Model Bias and Variance

  • Bias and Variance have traditionally been seen as a trade off

  • Low Bias => High Variance

  • Low Variance => High Bias

  • This is true for static models but modern algorithmic methods attempt to overcome it.

Gradient Boosting vs Random Forest

  • Both are additive ensembles of trees

  • Random Forest takes a low-bias, high-variance modelling procedure (deep trees) and tries to reduce the variance

  • GB takes a low-variance, high-bias modelling procedure (e.g. stumps/shallow trees) and tries to correct the bias

  • Also in GB, as the number of boosting steps gets larger, bias correction plays a decreasing role, but averaging kicks in, making it often immune to overfitting or variance increase (internal averaging)

GB vs RF (sketch)

Regularisation in tree ensembles

  • Regularisation constrains or dampens a model towards a baseline, in opposition to the complexity/variance
  • In tree ensembles, regularisation can constrain:
    • an individual tree's contribution
    • the complexity of trees (total depth)
    • the chance of individual node splitting
    • the size of values in the leaves
  • constraints can be hard limits or L1 or L2 penalties added to the loss function
  • adds bias to reduce variance

Gradient Boosting packages

R GBM Package

  • Tree based
  • Objective Types: Gaussian, AdaBoost, Bernoulli, Quantile Regression, Poission, CoxPH, Laplace, Pairwise, TDist, Tweedie
  • Optimal Stopping can be tested with test set, out-of-bag, or cross-validation methods
  • Case weighting
  • The best version is the GBM3 (fork of GBM) on github

XGBoost: Extreme Gradient Boosting

  • Linear model or Tree basis functions
  • Objective types: Regression (Linear, Gamma, Tweedie), Bernoulli, Poisson, Softmax/SoftProb, Ranking (pairwise dist)
  • Custom optimisation objectives and evaluation criteria
  • Takes \(2^{nd}\) order gradient into account
  • Column Subsampling (as well as row subsampling)
  • Explicit and implicit regularisation
  • Grow/Prune Tree fitting, not limited to greedy increments
  • Case weighting, dropout (DART)
  • Early Stopping; Continue after stopping

XGBoost: Extreme Gradient Boosting

  • Python, R, Julia, Scala, C++ interfaces
  • Feature-rich for big data and parallel computing
    • Shared Memory Parallel Processing: openMP
    • Sparsity Aware Algorithm (and sparse data input)
    • Cache friendly algorithms; out-of-core learning
    • MPI style distributed computation with 'Rabit' library

Fitting a boosted model

Things to think about

  • Loss function: Very Important Align it to your end objective as closely as possible

  • Hyperparameter tuning: Quite Important There are many methods: heuristics, grid search, adaptive grid search, random search, Bayesian optimisation ..

  • Feature engineering: Can be important Very specific to data and problem

Important tuning parameters (XGBoost)

  • Complexity: max_depth, min_child_weight and gamma

  • Subsampling: subsample, colsample_bytree

  • Learning Progression: eta, num_round

  • Imbalanced data: scale_pos_weight, max_delta_step

XGBoost

Parameter Tuning Hints from Owen Zhang

The Higgs challenge

Kaggle Higgs Boson Challenge

  • Higgs boson was the final piece of the particle physics Standard Model jigsaw
  • Signals from the Higgs boson are very rare
  • Kaggle dataset was generated from a mathematical model, the idea was to use the competition to develop/refine ML techniques to search for the signal
  • The classification problem was to find high probability regions
  • 250,000 samples in the training set with 30 predictors and a weights column
  • The data can be found on Kaggle or [http://opendata.cern.ch/record/328]

Kaggle Higgs Boson Challenge

Evaluation Metric

  • The evaluation metric is the approximate median significance (AMS):

\[AMS=\sqrt{2\left((s+b+b_r)log\left(1+\frac{s}{b+b_r}\right)−s\right)}\]

  • \(s,b\): unnormalised true positive and false positive rates, respectively
  • \(b_r=10\) is a constant regularisation term
  • \(log\) is the natural log

Kaggle Higgs Submission Scores

XGBoost 'Starter Script' was published during competition

XGBoost Higgs R Code

Tianqi's Code (snippet)

xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)

param <- list("objective" = "binary:logitraw",
              "scale_pos_weight" = sumwneg / sumwpos,
              "bst:eta" = 0.1,
              "bst:max_depth" = 6,
              "eval_metric" = "auc",
              "eval_metric" = "ams@0.15",
              "silent" = 1,
              "nthread" = 16)

watchlist <- list("train" = xgmat)
nround = 120
bst = xgb.train(param, xgmat, nround, watchlist );

Parameter tuning with Caret

XGBoost and Caret

Parameter Grid Setup

xgb_grid_3 <- expand.grid(
  nrounds = c(150, 250),
  eta = c( 0.01, 0.1, 0.2),
  max_depth = c(4,6),
  gamma = c(0, 1),
  colsample_bytree = c(0.66, 1),
  min_child_weight = c(5,10,20)
)
nrow(xgb_grid_3)
[1] 144

XGBoost and Caret

Parameter Grid Example

nrounds eta max_depth gamma colsample_bytree min_child_weight
150 0.01 4 0 0.66 5
250 0.01 4 0 0.66 5
150 0.10 4 0 0.66 5
250 0.10 4 0 0.66 5
150 0.20 4 0 0.66 5
250 0.20 4 0 0.66 5
150 0.01 6 0 0.66 5
250 0.01 6 0 0.66 5
150 0.10 6 0 0.66 5
250 0.10 6 0 0.66 5
150 0.20 6 0 0.66 5
250 0.20 6 0 0.66 5
150 0.01 4 1 0.66 5
250 0.01 4 1 0.66 5
150 0.10 4 1 0.66 5
250 0.10 4 1 0.66 5
150 0.20 4 1 0.66 5
250 0.20 4 1 0.66 5
150 0.01 6 1 0.66 5
250 0.01 6 1 0.66 5
150 0.10 6 1 0.66 5
250 0.10 6 1 0.66 5
150 0.20 6 1 0.66 5
250 0.20 6 1 0.66 5
150 0.01 4 0 1.00 5
250 0.01 4 0 1.00 5
150 0.10 4 0 1.00 5
250 0.10 4 0 1.00 5
150 0.20 4 0 1.00 5
250 0.20 4 0 1.00 5
150 0.01 6 0 1.00 5
250 0.01 6 0 1.00 5
150 0.10 6 0 1.00 5
250 0.10 6 0 1.00 5
150 0.20 6 0 1.00 5
250 0.20 6 0 1.00 5
150 0.01 4 1 1.00 5
250 0.01 4 1 1.00 5
150 0.10 4 1 1.00 5
250 0.10 4 1 1.00 5
150 0.20 4 1 1.00 5
250 0.20 4 1 1.00 5
150 0.01 6 1 1.00 5
250 0.01 6 1 1.00 5
150 0.10 6 1 1.00 5
250 0.10 6 1 1.00 5
150 0.20 6 1 1.00 5
250 0.20 6 1 1.00 5
150 0.01 4 0 0.66 10
250 0.01 4 0 0.66 10
150 0.10 4 0 0.66 10
250 0.10 4 0 0.66 10
150 0.20 4 0 0.66 10
250 0.20 4 0 0.66 10
150 0.01 6 0 0.66 10
250 0.01 6 0 0.66 10
150 0.10 6 0 0.66 10
250 0.10 6 0 0.66 10
150 0.20 6 0 0.66 10
250 0.20 6 0 0.66 10
150 0.01 4 1 0.66 10
250 0.01 4 1 0.66 10
150 0.10 4 1 0.66 10
250 0.10 4 1 0.66 10
150 0.20 4 1 0.66 10
250 0.20 4 1 0.66 10
150 0.01 6 1 0.66 10
250 0.01 6 1 0.66 10
150 0.10 6 1 0.66 10
250 0.10 6 1 0.66 10
150 0.20 6 1 0.66 10
250 0.20 6 1 0.66 10
150 0.01 4 0 1.00 10
250 0.01 4 0 1.00 10
150 0.10 4 0 1.00 10
250 0.10 4 0 1.00 10
150 0.20 4 0 1.00 10
250 0.20 4 0 1.00 10
150 0.01 6 0 1.00 10
250 0.01 6 0 1.00 10
150 0.10 6 0 1.00 10
250 0.10 6 0 1.00 10
150 0.20 6 0 1.00 10
250 0.20 6 0 1.00 10
150 0.01 4 1 1.00 10
250 0.01 4 1 1.00 10
150 0.10 4 1 1.00 10
250 0.10 4 1 1.00 10
150 0.20 4 1 1.00 10
250 0.20 4 1 1.00 10
150 0.01 6 1 1.00 10
250 0.01 6 1 1.00 10
150 0.10 6 1 1.00 10
250 0.10 6 1 1.00 10
150 0.20 6 1 1.00 10
250 0.20 6 1 1.00 10
150 0.01 4 0 0.66 20
250 0.01 4 0 0.66 20
150 0.10 4 0 0.66 20
250 0.10 4 0 0.66 20
150 0.20 4 0 0.66 20
250 0.20 4 0 0.66 20
150 0.01 6 0 0.66 20
250 0.01 6 0 0.66 20
150 0.10 6 0 0.66 20
250 0.10 6 0 0.66 20
150 0.20 6 0 0.66 20
250 0.20 6 0 0.66 20
150 0.01 4 1 0.66 20
250 0.01 4 1 0.66 20
150 0.10 4 1 0.66 20
250 0.10 4 1 0.66 20
150 0.20 4 1 0.66 20
250 0.20 4 1 0.66 20
150 0.01 6 1 0.66 20
250 0.01 6 1 0.66 20
150 0.10 6 1 0.66 20
250 0.10 6 1 0.66 20
150 0.20 6 1 0.66 20
250 0.20 6 1 0.66 20
150 0.01 4 0 1.00 20
250 0.01 4 0 1.00 20
150 0.10 4 0 1.00 20
250 0.10 4 0 1.00 20
150 0.20 4 0 1.00 20
250 0.20 4 0 1.00 20
150 0.01 6 0 1.00 20
250 0.01 6 0 1.00 20
150 0.10 6 0 1.00 20
250 0.10 6 0 1.00 20
150 0.20 6 0 1.00 20
250 0.20 6 0 1.00 20
150 0.01 4 1 1.00 20
250 0.01 4 1 1.00 20
150 0.10 4 1 1.00 20
250 0.10 4 1 1.00 20
150 0.20 4 1 1.00 20
250 0.20 4 1 1.00 20
150 0.01 6 1 1.00 20
250 0.01 6 1 1.00 20
150 0.10 6 1 1.00 20
250 0.10 6 1 1.00 20
150 0.20 6 1 1.00 20
250 0.20 6 1 1.00 20

XGBoost and Caret

Training Control

xgb_trcontrol_3 <- trainControl(
  method = "cv",
  number = 10,
  verboseIter = TRUE,
  returnData = FALSE,
  returnResamp = "all",                                                        
  # save losses across all models
  classProbs = TRUE,                                                           
  # set to TRUE for AUC to be computed
  allowParallel = TRUE,
  summaryFunction = twoClassSummary
)

XGBoost and Caret

Training Run

xgb_train_3 <- train(
  x = data,
  y = label,
  weights = weight,
  trControl = xgb_trcontrol_3,
  tuneGrid = xgb_grid_3,
  method = "xgbTree",
  metric="ROC",
  scale_pos_weight = sumwneg / sumwpos,
  missing = NA
)

Hyper Parameter Tuning 1 of 4

Hyper Parameter Tuning 2 of 4

Hyper Parameter Tuning 3 of 4

Hyper Parameter Tuning 4 of 4

Caret Tuning Message

ROC was used to select the optimal model using  the largest value.
The final values used for the model were 
nrounds = 250, 
max_depth = 6, 
eta = 0.1, 
gamma = 1, 
colsample_bytree = 1 
and min_child_weight = 20 

Useful xgboost helper functions

xgb.plot.importance( … )

xgb.plot.tree( … )

xgb.dump(bst7, "dump.raw.txt", with.stats = T)

booster[0] 0:[f0<50.5] yes=1,no=2,missing=1,gain=19.1203,cover=10000 1:[f0<27.5] yes=3,no=4,missing=3,gain=6.62227,cover=5000 3:[f0<21.5] yes=7,no=8,missing=7,gain=0.0800078,cover=2700 7:leaf=-0.150987,cover=2100 8:leaf=-0.159073,cover=600 4:[f1<70.5] yes=9,no=10,missing=9,gain=5.12141,cover=2300 9:[f1<34.5] yes=15,no=16,missing=15,gain=14.039,cover=1610 15:[f1<27.5] yes=19,no=20,missing=19,gain=0.174287,cover=782 19:leaf=-0.151579,cover=621 20:leaf=-0.169178,cover=161 16:[f0<46.5] yes=21,no=22,missing=21,gain=2.27443,cover=828 21:[f1<62.5] yes=27,no=28,missing=27,gain=1.94476,cover=684 27:leaf=-0.229251,cover=532 28:leaf=-0.185909,cover=152 22:leaf=-0.174229,cover=144 10:leaf=-0.152837,cover=690 2:[f0<73.5] yes=5,no=6,missing=5,gain=6.72211,cover=5000 5:[f1<70.5] yes=11,no=12,missing=11,gain=5.20528,cover=2300 11:[f1<35.5] yes=17,no=18,missing=17,gain=14.2643,cover=1610 17:[f1<28.5] yes=23,no=24,missing=23,gain=0.457774,cover=805 23:leaf=-0.147592,cover=644 24:leaf=-0.125756,cover=161 18:[f0<54.5] yes=25,no=26,missing=25,gain=2.48841,cover=805 25:[f0<52.5] yes=29,no=30,missing=29,gain=0.0744164,cover=140 29:leaf=-0.134878,cover=70 30:leaf=-0.110093,cover=70 26:[f1<61.5] yes=31,no=32,missing=31,gain=2.41692,cover=665 31:leaf=-0.0678413,cover=494 32:leaf=-0.109706,cover=171 12:leaf=-0.146728,cover=690 6:[f0<79.5] yes=13,no=14,missing=13,gain=0.129898,cover=2700 13:leaf=-0.140428,cover=600 14:leaf=-0.14887,cover=2100 booster[1] …

Xbgoost Documentation

Xbgoost Repo for further info

Summing up

What is good about XGboost

  • Model:
    • Gradient Boosting, regularisation, internal averaging
    • Boosted Tree Models (locally adaptive neighbourhoods)
    • Feature sampling (like Random Forest), and row sampling (like Bagging)
    • Both Explicit Regularisation (tree size and learning rate) and Implicit Regularisation L1 and L2 (with full tree pruning)

What is good about XGboost

  • Usability
    • Bespoke loss function can be used, and controls like early stopping/restarting
    • Any other evaluation statistics (built-in or custom; on a test set) can be collected and used for stopping rule
    • Good accessibility to the model and modelling information (text dumps; plots etc.)
  • Computation Optimisation
    • Huge bag of data-focussed and machine architecture-focused optimisations

Find out more